[1] O. Elisha and S. Dekel, Wavelet decompositions of Random Forests - smoothness analysis, sparse approximation and applications, JMLR 17 (2016).
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler, LabelEncoder
# models
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.model_selection import cross_validate
from sklearn.metrics import precision_score, recall_score
from xgboost import XGBClassifier, XGBRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn import svm
from sklearn.linear_model import LogisticRegression, LinearRegression
# shap
import shap
# skater
from skater.core.explanations import Interpretation
from skater.model import InMemoryModel
# Vis.
from plotnine import *
from matplotlib.colors import rgb2hex
from matplotlib.cm import get_cmap
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Random Forest, first introduced by Breiman in 1996, is an ensemble algorithm based on decision trees, that is widely used for both regression and classification problems.
For a better understanding of the Random Forest (RF) algorithm, we will start in the roots of the algorithm, the bagging. The bagging algorithm is a submethod of the bootstrap resmampling method, used for generating multiple versions of a predictor on different samples, in order to derive a better generalization of the model. This method usually works for unstable methods, with decision trees as the lead example.
Consider a general regression problem: Denote a train set $T$ of samples $\left\{ x_{i},f\left(x_{i}\right)=y_{i}\right\} _{i=1}^{N}\in\left(\varOmega_{0}\in\mathbb{R}^{n},\mathbb{R}\right)$. Let $c_{\varOmega}=\frac{1}{\#\left\{ x_{i}\in\varOmega\right\} }\sum_{x_{i}\in\varOmega}f\left(x_{i}\right)$ on $\varOmega$, a subdomain of $\varOmega_{0}$.
We can consider a regression problem since it is easy to convert any multiclass classification problem to a regression in the following way: Denote a train set $T$ of samples $\left\{ x_{i},y_{i}\right\} _{i=1}^{N}$, where each $x_{i}\in\mathbb{R}^{n}$ is assigned through a function $f$ to a class $y_{i}=C\left(x_{i}\right)$. This problem can be converted to numeric settings by assigining to each class value $C$ the value of a node on the regular simplex consisting of $L$ verticals in $\mathbb{R}^{L-1}$. Thus, our classification problem is now translated to a regression problem with train data $\left\{ x_{i},C\left(x_{i}\right)\right\} _{i=1}^{N}\in\left(\varOmega_{0}\in\mathbb{R}^{n},\mathbb{R}^{L-1}\right)$. In this case, $\vec{E}_{\varOmega}$ is the calculated mean over any subdomain, and in fact is a point inside the simplex.
The RF algorithm is based on the following steps:
The classic algorithm is based on Isotopic sub divisions in order to set the splits in each node in an efficent way. More formally, denote our current subdomain $\varOmega$ of the origin data $\varOmega_{0}$. The next split is defined by the following minimization probelm:
\begin{equation*} \underset{\varOmega^{'}\cup\varOmega^{''}=\varOmega}{min}\sum_{x_{i}\in\varOmega^{'}}\left(f\left(x_{i}\right)-c_{\varOmega^{'}}\right)^{2}+\sum_{x_{i}\in\varOmega^{''}}\left(f\left(x_{i}\right)-c_{\varOmega^{''}}\right)^{2} \end{equation*}Then, Elisha and Dekel show that this minimization problem is equivalent to the maximization problem
\begin{equation*} \underset{\varOmega^{'}\cup\varOmega^{''}=\varOmega}{max}\left\Vert \psi_{\varOmega^{'}}\right\Vert _{2}^{2}+\left\Vert \psi_{\varOmega^{''}}\right\Vert _{2}^{2} \end{equation*}where $\psi_{\varOmega^{'}}$ is the geometric wavelet, defined as \begin{equation} \psi{\varOmega^{'}}=I{\varOmega^{'}}\left(c{\varOmega^{'}}-c{\varOmega}\right) \end{equation} and $\left\Vert \psi_{\varOmega^{'}}\right\Vert _{2}^{2}=\sum_{x_{i}\in\varOmega^{'}}\left\Vert c_{\varOmega^{'}}-c_{\varOmega}\right\Vert _{2}^{2}=\left\Vert c_{\varOmega^{'}}-c_{\varOmega}\right\Vert _{2}^{2}\#\left\{ x_{i}\in\varOmega^{'}\right\} $.
The wavalet representation of $f$ on a single tree $t$ is $f=\sum_{\varOmega\in t}\psi_{\varOmega}$. Thus, we can derive the wavelet representation of the entire RF of the function $f$ as a weighted combination of the representations of the trees \begin{equation} \tilde{f}\left(x\right)=\sum{j=1}^{B}\sum{\varOmega\in t{j}}\omega{j}\psi_{\varOmega}\left(x\right) \end{equation} For simplicity, assume $\omega_{j}=\frac{1}{B}$.
Now, we can derive a "pruned" representation of the ensemble using Elisha and Dekel, by ordering the wavelts norms and considering only the most significant ones in the representation: \begin{equation} \left\Vert \psi{\varOmega{k1}}\right\Vert {2}\geq\left\Vert \psi{\varOmega{k2}}\right\Vert {2}\geq\left\Vert \psi{\varOmega{k3}}\right\Vert _{2}\ldots \end{equation} Unlike most of the pruning methods, this process is applied on the entire ensemble.
The final prediction is therefore $\hat{f}_{M}\left(x\right)=\frac{1}{B}\sum_{m=1}^{M}\psi_{k_{m}}\left(x\right)$.
The hyper-parameter $M$, which describes the number of wavelets in the final representation, is selected using a predefined validation data. More specifically, denote our train data $T$ and our validation data $V$, drawn from our original data. For each possible value of $M$, we derive the prediction of our model $\hat{f}_{M}\left(x\right)$ on each of the observations of our validation data. The chosen value of $M$ is the one that yields the minimal approximation error (for example, MSE) on the validation set. Generally, we would excpect that as we increase $M$, the error on the train data will decrease. However, the error on the validation data will decrease at first, but starting from a certain point, it may start to increase (i.e., we will start to overfit the train data).
Finally, the feature importance of feature $i$ for $i=1,\ldots,n$ can be calculated as \begin{equation} s{i}^{\tau}=\frac{1}{B}\sum{j=1}^{B}\sum{\varOmega\in t{j}\land v{i}}\left\Vert \psi{\varOmega}\right\Vert _{2}^{\tau} \end{equation} for $\tau>0$ and $v_{i}$ the set of child domains formed by partitioning their parent domain along the $i^{th}$ variable
def compute_vi(clf, X, y, features, k_fold):
"""
Compute VI over all k folds.
Parameters
----------
clf: Classifier
X: X matrix
y: labels
features: list of features
k_fold: `sklearn.model_selection.KFold` object
Returns
-------
pd.DataFrame with variables and importance score
"""
vi_df = pd.DataFrame()
for i, (train_index, test_index) in enumerate(k_fold.split(X)):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
fitted_clf = clf.fit(X_train, y_train)
curr = pd.DataFrame({'imp':fitted_clf.feature_importances_,
'variable':features,
'fold':i})
vi_df = vi_df.append(curr, ignore_index=True)
return vi_df
def vi_score(clf, X, y, k_fold, model_name,
vi, ks, scoring = {'acc' : 'accuracy',
'recall':lambda cls, X, y :
recall_score(y, cls.predict(X), average = 'micro'),
'precision':lambda cls, X, y :
precision_score(y, cls.predict(X), average = 'micro')}):
"""
Compute the scoring for a given model
Parameters
----------
clf: Classifier
X: X matrix
y: labels
k_fold: `sklearn.model_selection.KFold` object
model_name: Model name
vi: Sorted list of varibles
ks: list of k in [1, K] where K is the total number of features
scoring: Scoring matrics to compute
Returns
-------
pd.DataFrame with scoring for each fold and k in ks
"""
score_df = pd.DataFrame()
for k in ks:
vars_list = vi[:k]
curr_X = X[vars_list]
curr_score_df = kfold_score(clf, curr_X, y, k_fold, model_name, scoring)
curr_score_df['k'] = k
score_df = score_df.append(curr_score_df, ignore_index = False)
return score_df
def kfold_score(clf, X, y, k_fold, model_name, scoring):
"""
Compute the scoring for a given model
Parameters
----------
clf: Classifier
X: X matrix
y: labels
k_fold: `sklearn.model_selection.KFold` object
model_name: Model name
scoring: Scoring matrics to compute
Returns
-------
pd.DataFrame with scoring for each fold
"""
score = cross_validate(clf, X, y,
cv = k_fold, scoring = scoring,
return_train_score = False)
score_df = pd.DataFrame(score)
score_df['model'] = model_name
return score_df
def compute_shap_vi(mod, X, y, features, k_fold, reg = False):
"""
Compute SHAP-VI over all k folds.
Parameters
----------
mod: Classifier/Regressor
X: X matrix
y: labels
features: list of features
k_fold: `sklearn.model_selection.KFold` object
reg: True if regression
Returns
-------
pd.DataFrame with variables and importance score
"""
vi_df = pd.DataFrame()
for i, (train_index, test_index) in enumerate(k_fold.split(X)):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
fitted_mod = mod.fit(X_train, y_train)
# shap vi
explainer = shap.TreeExplainer(fitted_mod)
shap_vals = explainer.shap_values(X_train)
if reg:
mean_abs_shap = np.abs(shap_values).mean(0)
sum_abs_shap = np.abs(shap_values).sum(0)
else:
mean_abs_shap = np.array([np.abs(shap_vals[i]).mean(0) for i in range(len(shap_vals))]).mean(0)
sum_abs_shap = np.array([np.abs(shap_vals[i]).mean(0) for i in range(len(shap_vals))]).sum(0)
curr = pd.DataFrame({'imp_mean':mean_abs_shap,
'imp_sum':sum_abs_shap,
'variable':features,
'fold':i})
vi_df = vi_df.append(curr, ignore_index=True)
return vi_df
def compare_orderings(clf,
model_order,
X,
y,
k_fold,
scoring,
ks = range(1, 10)):
"""
Parameters
----------
model_order: 'vi_method_name':vi_order (string:list dict)
"""
vi_score_df = pd.DataFrame()
for model, var_list in model_order.items():
print("Calc. VI score for {}...".format(model))
curr_score = vi_score(clf,
X = X, y = y,
k_fold = k_fold,
model_name = model,
vi = var_list,
ks = ks,
scoring = scoring)
vi_score_df = vi_score_df.append(curr_score, ignore_index = True)
return vi_score_df
def plot_vi(vi_df, metric, title = "", figsize = (20, 8)):
vi_df.groupby(['k', 'model']).mean()[metric].reset_index().\
pivot_table(values=metric, index='k', columns='model', aggfunc=lambda x : x).\
plot(figsize = figsize, title = title)
Human Activity Recognition (HAR) has wide applications in healthcare (remote monitoring of patients), smart environments (smart homes, IoT etc.), sports (fitness applications and monitoring), and many more.
Nowadays, where nearly every smartphone (or smart watch) comes equipped with built in inertial sensors, such as accelerometers and gyroscopes, the technology for HAR is available for almost everybody.
Our goal is to built an HAR classification system, using accelerometer and gyroscope data generated by the user’s cell phone.
The data is taken from the Human Activity Recognition database, built from the recordings of 30 subjects performing activities of daily living while carrying a waist-mounted smartphone with embedded inertial sensors.[1](#1)
We will try and test different predictive algorithms, and estimate the models' accuracy on an independent test set.
The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities:
The activities where performed while wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, the researchers captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded in order for the researchers to annotate the data manually.[2](#2)
The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low frequency components, therefore a filter with 0.3 Hz cutoff frequency was used. From each window, a vector of features was obtained by calculating variables from the time and frequency domain.
The obtained data set, which consists of 562 features (explanatory variables) and the response, has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.
1. The data is available at the UC Irvine Machine Learning Repository (link to data)
2. Follow this link or simply click on the above image to watch a video features one of the participants performing the 6 activities recorded in the experiment.
har_test = pd.read_csv("data/har-test.csv")
har_train = pd.read_csv("data/har-train.csv")
print("Train size: {0}, Test size: {1}".format(har_train.shape, har_test.shape))
har_data = pd.concat((har_train, har_test), ignore_index = True)
print("Full data size: {}".format(har_data.shape))
har_data.head()
clean_colnames = [col.split('-')[0].split('(')[0] for col in har_data.columns]
pd.Series(clean_colnames).value_counts().sort_values().\
plot(kind = 'barh', figsize=(15, 8))
We will first examine the data using TSNE, by projecting the data into $\mathbb{R}^2$.
# remove subject, activity
subject, activity = har_data.subject, har_data.Activity
har_data.drop(columns=['subject', 'Activity'], inplace = True)
# sample
p = .25
n = activity.shape[0]
sample_size = int(n * p)
idxs = np.random.choice(range(n), size = sample_size)
tsne_data = har_data.iloc[idxs].copy()
tsne_activity = activity.iloc[idxs]
tsne_subject = subject.iloc[idxs]
pca = PCA(n_components=0.9, random_state=3)
scl = StandardScaler()
tsne_data = scl.fit_transform(tsne_data)
tsne_data = pca.fit_transform(tsne_data)
tsne = TSNE(random_state=3, perplexity=40)
tsne_data = tsne.fit_transform(tsne_data)
# Create subplots
fig, axarr = plt.subplots(2, 1, figsize=(18,10))
### Plot Activities
# Get colors
n = tsne_activity.unique().shape[0]
colormap = get_cmap('viridis')
colors = [rgb2hex(colormap(col)) for col in np.arange(0, 1.01, 1/(n-1))]
label_counts = tsne_activity.value_counts()
# Plot each activity
for i, group in enumerate(label_counts.index):
# Mask to separate sets
mask = (tsne_activity==group).values
axarr[0].scatter(x=tsne_data[mask][:,0], y=tsne_data[mask][:,1],
c=colors[i], alpha=0.5, label=group)
axarr[0].set_title('TSNE: Activity Visualisation')
axarr[0].legend()
### Plot Subjects
# Get colors
n = tsne_subject.unique().shape[0]
colormap = get_cmap('gist_ncar')
colors = [rgb2hex(colormap(col)) for col in np.arange(0, 1.01, 1/(n-1))]
# Plot each participant
for i, group in enumerate(tsne_subject.unique()):
# Mask to separate sets
mask = (tsne_subject==group).values
axarr[1].scatter(x=tsne_data[mask][:,0], y=tsne_data[mask][:,1],
c=colors[i], alpha=0.5, label=group)
axarr[1].set_title('TSNE: Participant Visualisation')
plt.show()
Nice! we can see a preaty good seperation in the upper plot. The lower plot seems more blurry, but we can still see some "islands" of color, meaning that different perticipant tend to perform some activities quite differently than others.
For each model and for each replication we will use the first $k$ best features to construct a model.
We will use 2 datasets:
We will compare the following models (and already implimanted FI):
SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations. While SHAP values can explain the output of any machine learning model, the authers have developed a high-speed exact algorithm for tree ensemble methods (paper).
We will use the sklearn.model_selection.GridSearchCV function (with 5-fold CV) for hyper-parameter tunning.
# our 5-fold generator
kcv = KFold(n_splits=5, random_state=3, shuffle=True)
# number of trees for all models
NTREE = 100
# ks - number of most imp features
ks = [1, 2, 3, 4, 5, 10, 15, 20, 25, 50, 100]
X = har_data.copy()
y = activity.copy()
We start with a simple RF model. We use grid search with 5-fold CV to select proper parameters.
rf_clf = GridSearchCV(RandomForestClassifier(n_estimators = NTREE, n_jobs = -1),
param_grid = {'max_leaf_nodes':[100, 200, 300],
'min_samples_split' : [.001, .005, .01],
'max_features' : ['log2', 'sqrt']},
cv = 5, verbose = 1)
rf_clf_grid_search = rf_clf.fit(X, y)
print(rf_clf_grid_search.best_params_)
Let us examine the RF model's (trained using best selected parameters) feature importance:
Find variable importance:
vi_rf_clf_df = compute_vi(rf_clf_grid_search.best_estimator_, X.values, y.values, X.columns, kcv)
rf_clf_vi = vi_rf_clf_df.groupby('variable').imp.mean()
plt.figure(figsize=(15, 8))
rf_clf_vi.nlargest(15).sort_values().plot(kind = "barh", color = "#7aadff")
plt.title("RF - Variable Importance")
plt.show()
We use the selected parameters to fit a model and compute variable importance.
# sorted variables - according to our RF model
sorted_rf_clf_vars = rf_clf_vi.sort_values(ascending = False).index.tolist()
shap.initjs()
explainer = shap.TreeExplainer(rf_clf_grid_search.best_estimator_)
shap_values = explainer.shap_values(X)
We first examine the feature importance for each class on the full dataset.
shap.summary_plot(shap_values, X, plot_type = 'bar')
Next, we compute the 5-fold averaged FI based on the SHAP method
vi_shap_df = compute_shap_vi(rf_clf_grid_search.best_estimator_, X.values, y.values, X.columns, kcv)
rf_shap_vi = vi_shap_df.groupby('variable').imp_sum.mean()
plt.figure(figsize=(15, 8))
rf_shap_vi.nlargest(15).sort_values().plot(kind = "barh", color = "#7aadff")
plt.title("RF with SHAP - Variable Importance")
plt.show()
sorted_rf_shap_vars = rf_shap_vi.sort_values(ascending = False).index.tolist()
params = {
'subsample': [0.8, 1.0],
'max_depth': [5, 9],
'learning_rate':[.001, .01]
}
xgb = XGBClassifier(n_estimators = NTREE)
grid_xgb = GridSearchCV(xgb,
param_grid = params,
cv = 5, verbose = 1,
n_jobs=-1)
xgb_grid_search = grid_xgb.fit(X, y)
print(xgb_grid_search.best_params_)
The 5-fold averaged FI chosen by xgboost model
vi_xgb_df = compute_vi(xgb_grid_search.best_estimator_, X.values, y.values, X.columns, kcv)
xgb_vi = vi_xgb_df.groupby('variable').imp.mean()
plt.figure(figsize=(15, 8))
xgb_vi.nlargest(15).sort_values().plot(kind = "barh", color = "#7aadff")
plt.title("XGBoost - Variable Importance")
plt.show()
sorted_xgb_vars = xgb_vi.sort_values(ascending = False).index.tolist()
And now, the averaged FI on the 5-folds based on the SHAP method on xgboost
vi_shap_xgb_df = compute_shap_vi(xgb_grid_search.best_estimator_, X.values, y.values, X.columns, kcv)
xgb_shap_vi = vi_shap_xgb_df.groupby('variable').imp_sum.mean()
plt.figure(figsize=(15, 8))
xgb_shap_vi.nlargest(15).sort_values().plot(kind = "barh", color = "#7aadff")
plt.title("XGB with SHAP - Variable Importance")
plt.show()
sorted_xgb_shap_vars = xgb_shap_vi.sort_values(ascending = False).index.tolist()
Skater feature importance implementation is based on an information theoretic criteria, measuring the entropy in the change of predictions, given a perturbation of a given feature. The intuition is that the more a model’s decision criteria depend on a feature, the more we’ll see predictions change as a function of perturbing a feature.
interpreter = Interpretation(X, feature_names=X.columns)
model = InMemoryModel(rf_clf_grid_search.best_estimator_.predict_proba, examples=X)
interpreter = Interpretation()
interpreter.load_data(X)
perm_feature_importance = interpreter.feature_importance.feature_importance(model)
plt.figure(figsize=(15, 8))
perm_feature_importance.sort_values(ascending=True).tail(20).plot(kind = 'barh')
sorted_perm_vars = perm_feature_importance.sort_values(ascending=False).index
wv_rf_vi = pd.read_csv("aws_outputs/mean_VI_clf.txt", header=None)
wv_rf_vi.columns = ['vi']
wv_rf_vi_series = pd.Series(wv_rf_vi.values.reshape(-1), index = X.columns)
wv_rf_vi.head()
wv_rf_vi_series.sort_values().sort_values(ascending=True).tail(20).plot(kind = 'barh')
sorted_wv_rf_vi = wv_rf_vi_series.sort_values(ascending=False).index
full_vi_clf_df = pd.DataFrame({'variable':X.columns, 'rf':rf_clf_vi[X.columns],
'rf_shap':rf_shap_vi[X.columns], 'xgb':xgb_vi[X.columns],
'xgb_shap':xgb_shap_vi[X.columns],
'perm':perm_feature_importance[X.columns],
'wv_rf':wv_rf_vi_series[X.columns]}).reset_index(drop=True)
full_vi_clf_df.head()
full_vi_clf_df.to_csv("results/clf_vi.csv", index=False)
full_vi_clf_df = pd.read_csv("results/clf_vi.csv")
full_vi_clf_df.head()
model_order = {'rf':sorted_rf_clf_vars,
'rf_shap':sorted_rf_shap_vars,
'xgb':full_vi_clf_df.sort_values('xgb', ascending = False).variable,
'xgb_shap':sorted_xgb_shap_vars,
'perm':sorted_perm_vars,
'wv_rf':sorted_wv_rf_vi
}
# # same
# model_order = {'rf':full_vi_clf_df.sort_values('rf', ascending=False).variable,
# 'rf_shap':full_vi_clf_df.sort_values('rf_shap', ascending=False).variable,
# 'xgb':full_vi_clf_df.sort_values('xgb', ascending=False).variable,
# 'xgb_shap':full_vi_clf_df.sort_values('xgb_shap', ascending=False).variable,
# 'perm':full_vi_clf_df.sort_values('perm', ascending=False).variable,
# 'wv_rf':full_vi_clf_df.sort_values('wv_rf', ascending=False).variable,
# }
pd.DataFrame(model_order).head()
# scoring metrics
scoring = {'acc' : 'accuracy',
'recall':lambda cls, X, y :
recall_score(y, cls.predict(X), average = 'micro'),
'precision':lambda cls, X, y :
precision_score(y, cls.predict(X), average = 'micro')}
# compute vi score for all orderings
vi_score_clf_rf_df = compare_orderings(rf_clf_grid_search.best_estimator_,
model_order,
X, y, kcv,
ks = ks,
scoring = scoring)
vi_score_clf_rf_df['test_err'] = 1 - vi_score_clf_rf_df['test_acc']
RF model based on the top k features chosen by all FI methods
plt.style.use('seaborn-deep')
plot_vi(vi_score_clf_rf_df, 'test_err')
# compute vi score for all orderings
vi_score_clf_svm_df = compare_orderings(svm.SVC(decision_function_shape='ovo', gamma = .01),
model_order,
X, y, kcv,
ks = ks,
scoring = scoring)
SVM model based on the top k features chosen by all FI methods
plt.style.use('seaborn-talk')
vi_score_clf_svm_df['test_err'] = 1 - vi_score_clf_svm_df['test_acc']
plot_vi(vi_score_clf_svm_df, 'test_err')
from sklearn.linear_model import LogisticRegression
# compute vi score for all orderings
vi_score_clf_lr_df = compare_orderings(LogisticRegression(),
model_order,
X, y, kcv,
ks = ks,
scoring = scoring)
Logistic regression model based on the top k features chosen by all FI methods
plt.style.use('seaborn-talk')
vi_score_clf_lr_df['test_err'] = 1 - vi_score_clf_lr_df['test_acc']
plot_vi(vi_score_clf_lr_df, 'test_err')
rename = {'model':'method', 'test_acc':'ACC', 'test_recall':'Recall',
'test_err':'Error', 'test_precision':'Precision'}
vi_score_clf_svm_df.rename(columns = rename, inplace = True)
vi_score_clf_svm_df['model'] = 'SVM'
vi_score_clf_rf_df.rename(columns = rename, inplace = True)
vi_score_clf_rf_df['model'] = 'RF'
vi_score_clf_lr_df.rename(columns = rename, inplace = True)
vi_score_clf_lr_df['model'] = 'LR'
columns = ['model', 'method', 'k', 'ACC', 'Error'] # , 'Precision', 'Recall'
clf_results_full = pd.concat((vi_score_clf_rf_df[columns],
vi_score_clf_svm_df[columns],
vi_score_clf_lr_df[columns]), ignore_index=True)
clf_results_full.head()
clf_results_full.to_csv("results/clf_results_full.csv", index=False)
Abalones are marine snails. Their taxonomy puts them in the family Haliotidae which contains only one genus, Haliotis, which once contained six subgenera. These subgenera have become alternate representations of Haliotis. The number of species recognized worldwide ranges between 30 and 130 with over 230 species-level taxa described. The most comprehensive treatment of the family considers 56 species valid, with 18 additional subspecies.
Abalones can be found along coasts of almost every continent. Usually, abalones are consumed as food all around the world, by different cultures. However, the bright and variety of colors of the interior side of their shells makes then an valuable object of adornment and decoration.
The age of an abalone can be determined from the number of rings in their shell. However, counting the number of rings in an abalone shell is an expensive method. Thus, one possible solution is predict the number of rings of an abalone from characteristics like height, diameter, lenght and weight measurements.
The Abalone dataset consist of following attributes:

path = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"
column_names = ["sex", "length", "diameter", "height", "whole weight",
"shucked weight", "viscera weight", "shell weight", "rings"]
abalone_data = pd.read_csv(path, names=column_names)
print("Number of samples: %d" % len(abalone_data))
abalone_data.head()
Let's take a look at the correlations and histograms of the model features
sns.pairplot(abalone_data, hue="sex")
plt.figure(figsize = (15,8))
sns.distplot(abalone_data.rings, color="#7aadff",
hist=False, rug=False, kde_kws={"shade": True})
# our 5-fold generator
kcv = KFold(n_splits=5, random_state=4, shuffle=True)
# number of trees for all models
NTREE = 200
# ks - number of most imp features
ks = range(1,11)
X = pd.concat([pd.get_dummies(abalone_data.sex),
abalone_data.iloc[:,(abalone_data.columns != "rings") & (abalone_data.columns != "sex")]], axis = 1).copy()
y = abalone_data["rings"].copy()
rf_reg_grid = GridSearchCV(RandomForestRegressor(n_estimators = NTREE, n_jobs = -1),
param_grid = {'max_leaf_nodes':[100, 200, 300],
'min_samples_split' : [.001, .005, .01],
'max_features' : ['log2', 'sqrt']},
cv = kcv, verbose = 1)
rf_reg_grid_search = rf_reg_grid.fit(X, y)
print(rf_reg_grid_search.best_params_)
rf_reg_vi_df = compute_vi(rf_reg_grid_search.best_estimator_, X.values, y.values, X.columns, kcv)
The averaged FI selection of RF based on 5-folds CV
rf_reg_vi = rf_reg_vi_df.groupby('variable').imp.mean()
plt.figure(figsize=(15, 8))
rf_reg_vi.nlargest(15).sort_values().plot(kind = "barh", color = "#7aadff")
plt.title("RF - Variable Importance")
plt.show()
# sorted variables - according to our RF model
rf_reg_sorted_vars = rf_reg_vi.sort_values(ascending = False).index.tolist()
shap.initjs()
explainer = shap.TreeExplainer(rf_reg_grid_search.best_estimator_)
shap_values = explainer.shap_values(X)
Applying the SHAP method on the regression task, we can examine the effect of the different features on a single observation. The magnitude of the variable represents its importance, where red and blue colors represent a positive and negative effect, respectively.
shap.force_plot(explainer.expected_value, shap_values[0,:], X.iloc[0,:])
Moreover, we can examine the FI for a group of observations, ordered by similiarty.
X_sample = X.sample(2000, random_state = 1).copy()
shap.force_plot(explainer.expected_value, shap_values[X_sample.index.values], X_sample)
The FI of SHAP on the full data can be represented in the following plots
shap.summary_plot(shap_values, X)
shap.summary_plot(shap_values, X, plot_type="bar")
rf_reg_vi_shap_df = compute_shap_vi(rf_reg_grid_search.best_estimator_,
X.values, y.values, X.columns, kcv, reg = True)
Finally, the averaged FI of the SHAP method computed on 5-folds:
rf_reg_shap_vi = rf_reg_vi_shap_df.groupby('variable').imp_sum.mean()
plt.figure(figsize=(15, 8))
rf_reg_shap_vi.nlargest(15).sort_values().plot(kind = "barh", color = "#7aadff")
plt.title("RF SHAP (regression) - Variable Importance")
plt.show()
sorted_rf_reg_shap_vars = rf_reg_shap_vi.sort_values(ascending = False).index.tolist()
params = {
# 'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': [0.7, 0.85, 1.0],
'max_depth': [3, 5, 7],
'learning_rate':[.001, .01, 0.1]
}
xgb = XGBRegressor(n_estimators = NTREE)
grid_xgb_reg = GridSearchCV(xgb,
param_grid = params,
cv = kcv, verbose = 1, n_jobs = -1)
grid_xgb_reg = grid_xgb_reg.fit(X, y)
print(grid_xgb_reg.best_params_)
vi_xgb_reg_df = compute_vi(grid_xgb_reg.best_estimator_, X.values, y.values, X.columns, kcv)
xgb_vi = vi_xgb_reg_df.groupby('variable').imp.mean()
sorted_xgb_vars = xgb_vi.sort_values(ascending = False).index.tolist()
The averaged FI of xgboost on the 5-folds:
plt.figure(figsize=(15, 8))
xgb_vi.nlargest(15).sort_values().plot(kind = "barh", color = "#7aadff")
plt.title("XGBoost regression - Variable Importance")
plt.show()
explainer = shap.TreeExplainer(grid_xgb_reg.best_estimator_)
shap_values = explainer.shap_values(X.values)
shap.summary_plot(shap_values, X.values, plot_type="bar")
xgb_reg_vi_shap_df = compute_shap_vi(grid_xgb_reg.best_estimator_,
X.values, y.values, X.columns, kcv, reg = True)
The averaged FI of the SHAP on xgboost
xgb_shap_vi = xgb_reg_vi_shap_df.groupby('variable').imp_sum.mean()
sorted_xgb_vars = xgb_shap_vi.sort_values(ascending = False).index.tolist()
plt.figure(figsize=(15, 8))
xgb_shap_vi.nlargest(15).sort_values().plot(kind = "barh", color = "#7aadff")
plt.title("XGBoost regression - Variable Importance")
plt.show()
sorted_xgb_reg_shap_vars = xgb_shap_vi.sort_values(ascending = False).index.tolist()
interpreter = Interpretation(X, feature_names=X.columns)
model = InMemoryModel(rf_reg_grid_search.best_estimator_.predict, examples=X)
interpreter = Interpretation()
interpreter.load_data(X)
perm_reg_feature_importance = interpreter.feature_importance.feature_importance(model)
plt.figure(figsize=(15, 8))
perm_reg_feature_importance.sort_values(ascending=True).plot(kind = 'barh')
sorted_perm_reg_vars = perm_reg_feature_importance.sort_values(ascending=False).index
wv_reg_vi = pd.read_csv("aws_outputs/mean_VI_reg.txt", header=None)
wv_reg_vi.columns = ['vi']
wv_reg_vi_series = pd.Series(wv_reg_vi.values.reshape(-1), index = X.columns)
wv_reg_vi.head()
wv_reg_vi_series.sort_values().sort_values(ascending=True).plot(kind = 'barh')
sorted_wv_reg_vi = wv_reg_vi_series.sort_values(ascending=False).index
full_vi_reg_df = pd.DataFrame({'variable':X.columns,
'rf':rf_reg_vi[X.columns],
'rf_shap':rf_reg_shap_vi[X.columns],
'xgb':xgb_vi[X.columns],
'xgb_shap':xgb_shap_vi[X.columns],
'perm':perm_reg_feature_importance[X.columns],
'wv_rf':wv_reg_vi_series[X.columns]}).reset_index(drop = True)
full_vi_reg_df.head()
full_vi_reg_df.to_csv("results/reg_vi.csv", index=False)
model_order = {'rf':rf_reg_sorted_vars,
'rf_shap':sorted_rf_reg_shap_vars,
'xgb':sorted_xgb_vars,
'xgb_shap':sorted_xgb_reg_shap_vars,
'perm':sorted_perm_reg_vars,
'wv_rf':sorted_wv_reg_vi
}
# scoring metrics
scoring = {'MSE':'neg_mean_squared_error',
'MAE':'neg_mean_absolute_error'}
# compute vi score for all orderings
vi_score_rf_df = compare_orderings(rf_reg_grid_search.best_estimator_, model_order,
X,y,kcv,
ks = ks,
scoring = scoring)
vi_score_rf_df['MSE'] = - vi_score_rf_df['test_MSE']
vi_score_rf_df['MAE'] = - vi_score_rf_df['test_MAE']
RF model based on the top k features chosen by all FI methods
plt.style.use('seaborn-deep')
plot_vi(vi_score_rf_df, 'MSE', title = "RF - Regression")
# compute vi score for all orderings
vi_score_svm_df = compare_orderings(svm.SVR(kernel='rbf', gamma = .01),
model_order,
X, y, kcv,
ks = ks,
scoring = scoring)
SVM model based on the top k features chosen by all FI methods
vi_score_svm_df['MSE'] = - vi_score_svm_df['test_MSE']
vi_score_svm_df['MAE'] = - vi_score_svm_df['test_MAE']
plt.style.use('seaborn-talk')
plot_vi(vi_score_svm_df, 'MSE', title = "SVM - Regression")
# compute vi score for all orderings
vi_score_linreg_df = compare_orderings(LinearRegression(),
model_order,
X, y, kcv,
ks = ks,
scoring = scoring)
Linear regression model based on the top k features chosen by all FI methods
vi_score_linreg_df['MSE'] = - vi_score_linreg_df['test_MSE']
vi_score_linreg_df['MAE'] = - vi_score_linreg_df['test_MAE']
plt.style.use('seaborn-talk')
plot_vi(vi_score_linreg_df, 'MSE', title = "Linear Regression")
vi_score_linreg_df.head()
rename = {'model':'method'}
vi_score_rf_df.rename(columns = rename, inplace = True)
vi_score_rf_df['model'] = 'SVM'
vi_score_svm_df.rename(columns = rename, inplace = True)
vi_score_svm_df['model'] = 'RF'
vi_score_linreg_df.rename(columns = rename, inplace = True)
vi_score_linreg_df['model'] = 'LR'
columns = ['model', 'method', 'k', 'MAE', 'MSE']
reg_results_full = pd.concat((vi_score_rf_df[columns],
vi_score_svm_df[columns],
vi_score_linreg_df[columns]), ignore_index=True)
reg_results_full.head()
reg_results_full.to_csv("results/reg_results_full.csv", index=False)
reg_df = pd.melt(reg_results_full, id_vars=['model', 'method', 'k'], var_name='metric')
reg_df['task'] = 'reg'
clf_df = pd.melt(clf_results_full, id_vars=['model', 'method', 'k'], var_name='metric')
clf_df['task'] = 'clf'
full_results = pd.concat((reg_df, clf_df), ignore_index=True)
full_results.head()
full_results.to_csv("results/full_results.csv", index=False)
full_vi_reg_df.head()
full_vi_clf_df.head()
full_vi_reg_df['task'] = 'reg'
full_vi_clf_df['task'] = 'clf'
vi_full = pd.concat((full_vi_reg_df, full_vi_clf_df), ignore_index=True)
vi_full.head()
vi_full.to_csv("results/full_vi.csv", index=False)